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^ ■ ABSTRACT 

In message passing programs, once a process terminates with an unexpected error, the terminated process can 
propagate the error to the rest of processes through communication dependencies, resulting in a program failure. 
■ Therefore, to locate faults, developers must identify the group of processes involved in the original error and faulty 

V/~j ' processes that activate faults. This paper presents a novel debugging tool, named MPI-PreDebugger (MPI-PD), for 

localizing faulty processes in message passing programs. MPI-PD automatically distinguishes the original and the 
propagated errors by checking communication errors during program execution. If MPI-PD observes any commu- 
nication errors, it backtraces communication dependencies and points out potential faulty processes in a timeline 
view. We also introduce three case studies, in which MPI-PD has been shown to play the key role in their debug- 
ging. From these studies, we believe that MPI-PD helps developers to locate faults and allows them to concentrate in 
correcting their programs. 

^ ' KEYWORDS: parallel processing; message passing; debugging; fault localization 



X. 

1 Introduction 



In recent years, cluster/grid computing |Buy99| IFK98I is emerging as a cost-effective methodology for high 
performance computing. The message passing paradigm IIMes94l is a widely employed programming paradigm 
that gives us efficient parallel programs on these computing environments. 

However, debugging message passing programs is usually time-consuming, since we have to investigate a 
large amount of debugging information compared to sequential programs. Furthermore, once a process termi- 
nates with an unexpected error pVlSR77 |. the terminated process can propagate the error to the rest of processes 
through communication dependencies. For example, if a process terminates before sending an intended mes- 
sage, the receiver process that has no original fault also terminates, since it fails to receive the expected message. 
This error propagation makes it complicated to locate the hidden faults from a number of observed errors. 

To give developers valuable insights for debugging, a number of debugging tools have been developed 
for message passing programs. Post-mortem performance debuggers such as ParaGraph IHE91I . ATEMPT 
IKGV96I . XMPI IILAM02I . and Vampir |Pal99i visualize detailed timeline view of communications, so that 
developers can intuitively understand program behaviors. 
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Source-level debuggers such as TotalView IEtn02l . MPIGDB IBGLOOI . and CDB IWCS02I allow stepwise 
execution of programs. TotalView also has a facility for visualizing, named Message Queue Graph (MQG), 
which shows the states of the pending send and receive operations. MPIGDB is based on a sequential debugger, 
GDB ISPS02I . and allows developers to broadcast terminal input to all GDB processes attached to computing 
processes. CDB also provides a similar debugging environment by employing GDB at its lower layer. 

Fault localization IJHS02I is another approach for debugging programs. Relative debugging IHJOOMWAOll 
is a kind of fault localization for programs that have been ported from sequential to parallel architectures or 
between different parallel architectures. It dynamically compares data between two executing programs, so 
that can locate errors in the compared programs. In INBDK96I . Netzer et al. have pointed out that unforeseen 
consequences of bugs can cause messages to arrive in unexpected orders. Their algorithm dynamically locates 
errors by detecting unintended nondeterminism, or race conditions. 

Process grouping |Kra02b Kun93 SNdKOO) is a fundamental technique for scalable visualizing and de- 
bugging. DeWiz |Kra()2a. Kra()2bj aims at identifying closely related processes and reducing the amount of 
trace data. Given a specific process, DeWiz isolates the related processes according to the accumulated length 
of transmitted messages. 

Thus, a number of tools provide useful debugging functions. However, developers still suffer for selecting 
the original error from a number of observed errors, including original and propagated errors. Once the original 
error is given to developers, they can immediately investigate faults by using existing debuggers and concentrate 
in correcting them. 

In this paper, we propose a novel debugging tool, named MPI-PreDebugger (MPI-PD), for localizing faulty 
processes in message passing programs. Current MPI-PD supports programs written using the Message Passing 
Interface (MPI) standard |Mes94| and focuses on faults that terminate program execution. MPI-PD aims at 
reducing developers' workloads required for localizing faulty processes in timeline visualization. 

To achieve this, MPI-PD dynamically checks communication errors in accordance with the error definition 
in a program execution model. If MPI-PD observes any communication errors, it then generates a trace file, 
backtraces communication dependencies and points out potentially faulty processes in a timeline view. Thus, 
MPI-PD reduces the amount of debugging information before developers visualize and investigate it by using 
performance debuggers and source-level debuggers. 

The rest of this paper is organized as follows. Section |2] formally characterizes communication errors in 
MPI programs and makes clear the differences among faults, errors, and failures. Section[3]gives an algorithm 
for localizing faulty processes in a given trace file while Section |4] presents MPI-PD, which implements the 
proposed algorithm. Section[5]introduces three case studies assisted by MPI-PD. At last, Section[6]concludes 
this paper. 

2 Modeling Behavior of Message Passing Programs 

This section shows a definition of communication errors in MPI programs. We define it by extending the 
program execution model described in INM92I . 

2.1 Event graph: program execution model 

An execution of a message passing program is defined as a directed graph, G = (£,—>), where E represents a 
finite set of events while — > represents the happened-before relation ILam78l defined over E INM92I . In the 
following, we call this directed graph the event graph IKra02al . 

An event in this context represents the execution instance of a set of consecutively executed statements in 
some process |NM92|. Any event e £ E is observed during a program execution. In the following, let e p j be 
the « th event on process p. 

The happened-before relation — > shows how events potentially affect one another ILam78l . This relation is 
defined as the irreflexive transitive closure of the union of two other relations: — ►= (— > U Here, — > and 
— > respectively represent the sequential order relation and the concurrent order relation as follows IKra02al : 
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(a) Blocking communication (b) Nonblocking communication 
Figure 1: Order relations between events. A node represents an event and an arrow represents a relation. 

Sequential order relation, — >: As illustrated in Figure^a), the sequential order of events, e p \ — ► e p j + \, de- 
fines that the ft 1 event e p j on any sequential process p occurred before the i + 1 st event e p j + i. 

Concurrent order relation, — As illustrated in Figure [3 a), the concurrent order of events, e p j — » e q j, de- 
fines that the ; th event e p j on any process p occurred directly before the / h event e q < on another process 
q, if e p i is the sending of a message by process p and e q j is the receipt of the same message by another 
process q. 

Although the event graph is a sufficient model for visualizing the behavior of message passing programs, 
we have to add one relation to this graph to characterize the errors relevant to nonblocking communications 
IMes94l . This additional relation exists between a pair of events caused by the initiation and the completion of 
a nonblocking send/receive operation: 

Nonblocking order relation, — k As illustrated in Figure[nb), the nonblocking order relation, — >, shows the 

order in which nonblocking messages are initialized and then completed: e p , — > e p % defines that e Pl i — > 
e Pi k, if e p j is the send/receipt initiation of a message by process p and e p ± is the completion of the same 
message by the same process p. 

In our extended event graph, the happened-before relation is redefined as — >= (— » U — » U ^) + . 



2.2 Fault, error, and failure 

The concepts of faults, errors, and failures |MSR77 1 used in our discussion are briefly explained as follows: a 
program with a bug has a fault in itself and an active fault causes an error. If the error fails to be corrected, it 
causes a failure. 

Failure event. .Error event 



P f-^r 

q j — ■ 

x r — &~ : '-^:y 



Faulty process " " Faulty event 



Figure 2: Fault, error, and failure events. While a crossed node represents an unexpectedly terminated event, a 
dotted node represents expected but non-occurred event. 

Figure I3 shows an example that interprets these three concepts on events. In this example, process r is 
the faulty process, since it executes a faulty statement and causes a faulty event. It also terminates against 
developer's intension, so that causes a failure event. After this, process q fails to pass a message to process 
r, so that causes an error event, resulting in a failure event (since it terminates). Process p also faces with a 
communication error, however, its error handler avoids its failure. 
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Let is _f ailed (e) denote whether event e causes a failure or not. Since failure events have no successor and 
occur when programs unexpectedly terminate, is_failed(e) is defined as follows: 

is _Jailed(e) = the program terminated unexpectedly. 

2.3 Communication errors in MPI programs 

In MPI programs, an event causes a communication error, if it satisfies one of the following two conditions: 
isolated or truncated, defined as follows; 

• Isolated events. 

- An event e p j (e q j) is called an isolated send (receive) event, if ^3 e q j E E (e p j G E) such that 
e p i — > e q j, respectively IKra02al . 

- An event e P; j (e p ^) is called an isolated send/receive initiation (completion) event, if ^3 e p ^ € 

N 

E (e p j S E) such that e p j — > e p k, respectively. 

• Truncated events. 

- Two events e p .; and e q j are called truncated events, if e p , — > e q j and len(e p j) > len(e q j), where 
len(e p i) and len(e q j) represent the length of the send buffer specified in event e p i and the receive 
buffer specified in event e q j, respectively. 

Isolated events are caused under the following two situations. One is the mismatch of occurred events and 
the other is the non-occurrence of expected events. First, occurred but mismatched events can trigger off an 
error propagation. For example, an MPI routine call with an invalid tag/communicator |Mes94| or an invalid 
source/destination rank fails to pass the intended message. Similar mismatch can occur between the initiation 
and the completion of a nonblocking send/receive operation. Next, expected but non-occurred events cause 
serious problems, since they can propagate errors through all processes. For example, if a process terminates 
before sending an intended message, the receiver process that has no original fault also terminates, since it fails 
to receive the expected message. Thus, isolated events propagate errors similarly to the domino effect, leading 
to a program failure. 

A pair of truncated events indicates an occurrence of an overflow at the receive buffer. In a strict sense, a 

message should be passed between the send and the receive operations with the same buffer length |i Kra02al . 

c 

However, as MPI does, we also permit passing a message between events e p ., and e q j such that e p j — > e q j and 
len(e P} i) < len(e q j). In practice, some nondeterministic applications require this flexibility, because the receiver 
processes in these applications want to receive a variable length message at one receive operation. Therefore, 
we permit passing a message between events with different buffer length except for truncated events. 

Thus, the error of an event can depend on that of an event on another process. In this paper we call that 
processes p and q have a communication dependency if the error of event e p j on process p determines that of 
event e q j on another process q. 

Here notice that MPI has four communication modes | Mes94 1 : the standard, buffered, synchronous, and 
ready modes. These modes differ by when they solve the matching of outgoing messages. For example, when 
two processes send a message to each other, they fall into a deadlock in the synchronous mode while they 
are deadlock-free in the buffered mode. Therefore, we have to check communication errors without destroying 
these communication semantics in the target programs. That is, outgoing messages have to be checked in the 
same mode as their original mode. The error detection mechanism employed in MPI-PD is presented later in 
SectionIO 

For collective communications, since they can be implemented by using point-to-point communications, 
we repeatedly apply the above error definition to all of the point-to-point messages that compose the collective 
communication. 

In the following, let is_isolated (e p j) denote whether event e p j is isolated event or not. Let is_truncated(e p j, e 
also denote whether events e p ,• and e q ,■ are truncated events or not. 
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3 Algorithm for Localizing Faulty Processes 



This section presents the details of our proposed algorithm. We describe how to localize faulty processes in a 
given event graph. We assume here that the event graph is already generated by the error detection mechanism 
presented later in Section l4~2l 
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2. // Input; P, a set of process ranks. 




3. // G, an event graph. 




4. // Output: P e , a set of localized faulty process ranks. 
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16. P e := 0; 




17. foreach (p e P) begin 




18. if (BacktraceCommDep(p, 0) ^ 0) then P e := P e U {/>}; 


// Process p has faults. 


19. end 




20. end 




21. // A recursive function that backtraces communication dependencies 


from process p. 


22. function BacktraceCommDep(p, P dep ) 




23. begin 




24. if ((p €P e )\\ ((fe p = null) && (P dep = 0))) then return 0; 


II p is already traced or valid. 


25. else if (fe p is a calculation event) then return -1; 


// (a) Calculation fault. 


26. else if (fe p = null) then return -2; 


// (b) Non-occurred event. 


27. else if (p e Prf c/) ) then return p; 


// (c) Deadlock or (d) Overflow. 


28. endif 




29. q.= ptnr(fe p ); II Source/destination rank for fe p 




30. Q dep := P dep U{p}; II Update the call history. 




31. retval := BacktraceCommDep(^, Q dep )\ 




32. if (rerva/ ^ 0) then P e :=P e U{q}; II Process q has faults. 




33. if (ren'a/ = p) then retval :=0; 




34. else if (retval < 0) then refva/++; 




35. endif 




36. return retval; 




37. end 





Figure 3: Algorithm for localizing faulty processes. 



Figure [3] shows our algorithm, which requires a set of process ranks, P, and an event graph, G, and returns 
sets of localized faulty processes and the failure events on each process, P e and E e , respectively. Our algorithm 
consists of two stages as follows: 

• Identification of failure events (see line 7-14 in Figure^}. 

• Localization of faulty processes (see line 15-37 in Figure[3j. 

At the first stage, the algorithm identifies all failure events. After this stage, it localizes faulty processes by 
backtracing communication dependencies in a recursive manner. Our algorithm then classifies program failure 
into the following four situations: 

(a) Calculation fault: Figure |^a) illustrates this situation. As a result of backtracing, our algorithm finds 
that process s terminates unexpectedly and has no communication dependency to any other processes. 
Therefore, the algorithm determines that the faulty process is process s, which causes a calculation fault. 
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(a) Calculation fault 



(b) Non-occurred event 
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Deadlock among processes q-s 
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(c) Deadlock 



(d) Overflow 



Figure 4: Four failure situations classified by proposed algorithm. 



(b) Non-occurred event: Figure|4lb) illustrates this situation, in which process s has a communication depen- 

dency from r but terminates successfully. In this situation, we think whether process r could have sent 
a message redundantly or process s could have missed to call a receive routine. However, it seems to be 
difficult to automatically identify the faulty process from processes r and s. Therefore, our algorithm de- 
termines that the faulty processes are both of processes r and s, or a process left by a normally terminated 
process and the terminated process. 

(c) Deadlock: A deadlock occurs if there exists a cyclic communication dependency. In Figure^c), processes 

q, r and s fall into a deadlock. Our algorithm determines that the faulty processes are all the processes 
that participate in the deadlock. 

(d) Buffer overflow: In Figure|^d), process s causes a buffer overflow. As same as situation (b), it also seems 

to be difficult to identify which of processes r and s has called an MPI routine with an invalid buffer 
length. Therefore, our algorithm determines that the faulty processes are both of processes r and s, which 
have a pair of truncated events. 

Notice that the algorithm described in Figure|5]backtraces communication dependencies by assuming that 
all the source/destination ranks are valid. Therefore, if a faulty process calls an MPI routine with an invalid 
source/destination, this algorithm can omit the faulty process from the localized processes. We discuss this 
problem later in Section l5TT1 



This section presents the details of MPI-PD, including its environment for debugging and its mechanism for 
run-time error detection. 

4.1 Overview of debugging environment 

Figure |3 shows the debugging process with MPI-PD. The debugging functions in MPI-PD are implemented 
using the C++ language and the Ruby-GNOME toolkit |Rub02| and composed of three components: the instru- 
ment tool mpi2pd, the run-time error detection library libpdmpi.a, and the localize and visualize tool pdview. 

The instrument tool mpi2pd automatically replaces all of the MPI routines in programs with instrumented 
MPI routines based on pattern-match rules. The instrumented routine is a combination of the original MPI 
routine and the run-time error detection function. After this replacement, developers have to generate the object 
codes by compiling their programs and the executable binary file by linking the object codes with the run-time 
error detection library. 



4 MPI-PreDebugger 
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Figure 5: Debugging process with MPI-PD. 



The run-time error detection library checks communication errors whenever the processes call the instru- 
mented MPI routines (see Section l4~2l . If the library detects any communication error, it terminates program 
execution and generates a trace file. The trace file has the following information for every event observed during 
program execution: (1) event number, (2) process rank, (3) corresponding line in source code and its file name, 
and (4) corresponding MPI routine and its arguments. 

Given a trace file, the visualization tool pdview allows developers to view the behavior of the terminated 
program, as shown in Figure|5] It visualizes the event graph, which has the process axis in vertical and the time 
axis in horizontal, and shows the result of the fault localization described in Section [3] In the event graph, a 
colored node corresponds to an event and the type of the MPI operation that caused the event decides its color. 
A solid line between two nodes corresponds to a successful communication while a dotted line corresponds to 
a failure communication. 

In default mode, pdview avoids visualizing the entire event graph. It visualizes all of failure events occurred 
on each process and the successful events occurred directly before the failure events. Furthermore, pdview can 
isolate faulty processes from the event graph. Developers can visualize an isolated event graph by selecting 
process whichever they want. In addition to these visualization functions, pdview also shows following infor- 
mation: 



• Faulty processes localized by the proposed algorithm. 

• Failure situation selected from four situations (see Figure|4j. 

Furthermore, developers can investigate every visualized event. If they click the mouse on a node in the 
visualized event graph, then pdview pops up a dialog, which shows information ( l)-(4) about the corresponding 
event and its error reason (isolated/truncated). This information is useful for developers to locate faults in 
programs. After this fault localization, source-level debuggers can effectively assist developers to investigate 
the detailed behavior of the localized part. 

4.2 Mechanism for run-time error detection 

MPI-PD checks the occurrence of communication errors during program execution. If it detects any errors, it 
generates a trace file. 
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To realize this, we employ three methodologies. We first discuss on the synchronous blocking send (MP I_S s e 
then others. The three methodologies are as follows: 

• Manager process: To generate trace files under a deadlock situation, we employ a manager process M p for 
every process p. M p checks the value of is_f ailed '(e Pt i) before its responsible process p executes event 
e Pj ,-. We present later how to check is_f ailed '(e Pt i) at next paragraph. If M p obtains is_f ailed '(e Pt i) = 
false, it allows p to execute event e p \ and pushes the information about e p .,- into its local Event Graph 
E p . Otherwise, it detects a communication error, terminates p and generates a trace file from E p . 

• Message queue: To handle nonblocking communications, we employ a message queue. For nonblocking 
communications, to decide the failure of completion event e Pi k, we have to refer the information about its 

N 

corresponding initiation event e p , (e p , — > e p j). Therefore, for all processes p, manager M p has its own 
message queue Q p for referring to the information about the past events. 

• Timeout mechanism: We also employ a timeout mechanism due to the difficulty in distinguishing the 
valid and the invalid computation. For example, a receive event e q j that never receive a message has to 
be decided as is_isolated[e q ,f) = true. However, it is hard forM t/ to identify whether the sender p sends 
the message or not. That is, p can send the message after heavy computation or can fall into an infinite 
loop. Therefore, M p holds a timeout time t(e p j) for every e p j and decides is_isolated(e p f) = true when 
the time is up. 

Figure [6] shows the process of run-time error detection for MPI_Ssend. In Figure|6j the manager of the 
sender has three states (states C, SI and S2) and that of the receiver has four states (states C, Rl, R2 and R3) 
as follows: 



M p . 



si 



S2 



P 

Send call' , \ ark (p -1 

„ reg m (e p K acK ^ e i.J>, 

Recv call 

9 




' infill, il 

message 



; Rl 1 C R2 j R3 ; 

(a) Successful case 



(Trace file generation) 




R2 1 C 

(b) Failure case 



Figure 6: Process of run-time error detection for the synchronous blocking send (MPI_Ssend). Events e p ., 
and e q j correspond to MPI_Ssend and MPI_Recv calls, respectively. 



Common state for the sender/receiver: 

State C: Timeout checking and control-message waiting. In this state, M p continues to check Q p whether 
there exist any timeout events, until it receives any control message (ack or request messages) from p or 
another manager. If M p detects a timeout event e p j, then it decides is_f ailed ie p i) = true and sends an 
abort request abort p (e p f) to p. It also adds the failure event e Pj ; to E p and terminates. If M p receives a 
control message, then it changes its state to an appropriate state. 

States for the sender: 

State S 1 : Send initiating. If M p receives a send request req p (e Pt i) from p, then it pushes the information about 
e p .t into Q p with t [e p ,,). It also checks the destination rank of e p j and transmits a send request req m (e p j) 
to the destination process's manager, M q (go to state C). 

State S2: Message sending. If M p receives an ack ack m (e q j) from another manager, then it searches Q p and 
selects e p j such that is_isolated(e p j) = false. It also checks whether e p j and e q j are truncated events. 
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• If is_truncated(e p j,e q j) = false, M p decides is_failed[e p f) = false and sends an ack ack p {e p f) 
to p. After this acknowledgement, it deletes e p j from Q p , and adds both e p j and e q j to E p (go to 
state C). 

• Otherwise, M p decides is_f ailed (e p f) — true and sends an abort request abort p (e p i) to p. It also 
adds both e p j and e q j to E p as failure events and terminates. 

States for the receiver: 

State Rl: Receive initiating. If M q receives a receive request req q (e q A, it then searches Q q and selects e P j 
such that is_isolated (e p J) V is_isolated (e q j) = false. 

• If such e p j exists, M q decides that e P j and e q j are the matching events (go to state R3). 

• Otherwise, it leaves the error detection on e q j and pushes the information about e q j into Q q with 
t(e q j) (go to state C). 

State R2: Send-request receiving. If M q receives a request req„,(e p j) from another manager, then it searches 
Q q and selects e q j such that is_isolated(e p j) V is_isolated(e q j) = false. 

• If such e q j exists, M q decides that e p j and e q j are the matching events (go to state R3). 

• Otherwise, it leaves the error detection on e p j and pushes the information about e p j into Q q with 
t{e pJ ) (go to state C). 

State R3: Message receiving. M q sends an ack ack m (e q j) to M p . It then checks if e p i and e q j are truncated 
events. 

• If is_truncated(e p j,e q j) = false, then M q decides is_failed(e q j) = false and sends an ack 
ack r (e q j) to q. After this acknowledgement, it deletes e q j (e Pj from Q q and adds both e p j and 
e q j to E q (go to state C). 

• If is_truncated(e p i, e q j) = true, then M q decides is_f 'ailed (e q j) = true and sends an abort request 
abort q (e q f) to q. It also adds both e p ,■ and e q j to E q as failure events and terminates. 

The manager processes buffer all events until they detect an error, so that their local memory are possibly 
full. Our algorithm described in Figure[3]requires failure events on each process. Therefore, if local memory of 
M p is full, we allow M p to delete information about the oldest successful event from E p . 

Here, recall that we have to keep the communication semantics, as explained in Section l23l Therefore, 
for the blocking buffered mode send (MPI_Bsend), we alter the sequence of error detection. That is, to keep 
the buffered behavior of message passing, process p passes the original message immediately after sending 
request req p (e p j) to its manager M p . This alternation omits receiving an ack ack p (e p j) from M p . Instead of 
this omission, p checks an abort message abort p (e p f) from M p whenever it calls an instrumented MPI routine. 
If p receives the abort message abort p (e p f), it terminates its execution. Otherwise, it continues processing the 
original routine. This alteration allows p to execute a few events after an original faulty event, however there is 
no influence on faulty process localization since M p identifies the faulty event correctly. 

For nonblocking communications, we process states SI and Rl at the send initiation and the receive ini- 
tiation of nonblocking operations, respectively; and process send acks at the completion of the nonblocking 
operations. For collective communications, we can apply the same approach as for the blocking mode point-to- 
point routines, since the collective communications can be implemented by using those point-to-point routines. 

Thus, exchanging information about every event among managers enables us to detect communication 
errors and generate trace files before program failure. 
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Table 1: Summary of case studies. \L\, \P\, and \E\ represent the numbers of lines, processes, and events, 
respectively. 



Case study 


Details of program 


Details of trace file 


Developer 


\L\ 


Employed MPI routines 


1*1 


\E\ 


1. Applicability 


Beginner 


300 


Send, Recv, Isend, Irecv, Wait 


4 


412 


2. Scalability 


Expert 


40,000 


Send, Recv, Sendrecv 


64 


9,774 


3. Usability 


Compiler 


20,000 


Isend, Irecv, Waitall 


15 


253 



Table 2: Application results of MPI-PD. 



Debugging phase 


Number of programs 


Success 


Failure 


MPI Program execution 


13 of 28 


15 of 28 


Event graph visualization 


15 of 15 


Oof 15 


Faulty process localization 


12 of 15 


3 of 15 



5 Case Studies: Debugging Message Passing Programs with MPI-PD 

In this section we introduce three case studies. The aim of each study is to investigate the effectiveness of 
MPI-PD from the following point of view: 

1. Applicability: We investigated what kinds of faults are effective for MPI-PD. To do this, we applied 
MPI-PD to a few ten of the Gaussian programs developed by MPI beginners (see Section l5"TY 

2. Scalability: This study shows an example of scalable debugging using MPI-PD. We applied MPI-PD to 
a parallel rendering program ITIH03I developed by MPI experts on 64 processes (see Section l5~2l . 

3. Usability: We investigated the usability of faulty process localization. To do this, we applied MPI-PD to a 
complicated program generated automatically by a parallelizing compiler IYTFI 1021 . We also compared 
visualization results between proposed MPI-PD and existing Total View | Etn02 1 (see Section l5~3l . 

Tabled shows a summary of the above studies. In the following, we omit "MPI_", the prefix of MPI 
routines, as shown in Tabled 

In these studies we used a PC cluster with 64 symmetric multiprocessor (SMP) nod es. Each n ode in the 
cluster has two Pentium III 1GHz processors and connects to a Myrinet-2000 switch ]BCF + 95l . We also 
employed an MPI implementation, MPICH-GM |Myr02| . 

5.1 Study 1: Applicability of MPI-PD 

In this study, we applied MPI-PD to 28 faulty programs developed by six graduate students through a practice 
in MPI programming. These programs solve simultaneous equations using Gaussian elimination. 

We first executed the programs on our PC cluster and then visualized localization results by using MPI-PD. 
Table|2]shows the application results at each debugging phase. 

At the execution phase, 15 of 28 programs unexpectedly terminated. As we mentioned in Section Q since 
current MPI-PD focuses on faults with program failures, it failed to visualize the event graph for the remaining 
13 programs that never terminated but returned incorrect results. These programs contain semantic faults such 
as invalid specifications of operators/variables and invalid writing to message buffers before the completion of 
nonblocking communications. 
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At the localization phase, MPI-PD successfully localized faulty processes for 12 of 15 programs while it 
failed to localize them for the remaining three programs. These three programs have calculation faults activated 
by all processes at the same statement. Therefore, every process terminated outside the instrumented MPI 
routines, so that their trace files contained no information about failure events. Thus, MPI-PD failed to localize 
their faulty processes. However, in these cases, since every process terminates without any communication 
dependency, error propagation is unable to occur. Therefore, developers have to investigate every process. That 
is, they have to investigate their programs between the last MPI routine executed in a success and the next MPI 
routine expected to be executed, especially where the common statements that every process executes. 

The 12 programs which MPI-PD successfully localized had a variety of faults classified into following four 
types. Notice that MPI-PD localized not the faults but the faulty processes which activate them. 



• Invalid source/destination rank (six programs). 

• Invalid length of message buffer (three programs). 

• Calculation fault (two programs). 

• Deadlock occurred when passing long messages (one program). 



We next confirmed that there was no faulty process omitted from the localized results. For all cases where 
invalid source/destination ranks were specified, MPI-PD pointed out deadlock processes, including the faulty 
process. Therefore, the deadlock processes pointed out by MPI-PD can include valid processes, so that there 
exists a room for improving the accuracy of localization. However, this redundancy was a little problem for the 
programs applied in this study. Since their faults appear on any number of processes, developers are allowed to 
scale down the number of processes without missing the activated faults. 



5.2 Study 2: Scalable debugging with MPI-PD 
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We applied MPI-PD to a parallel rendering program [TIH03 1 implemented on 64 processes. This program 
has a fault in gathering and compositing rendered images generated by distributed processors. For the purpose 
of high-speed compositing, the developers have implemented own collective communication routines for the 
gather and the broadcast operations by using point-to-point routines, Send and Recv. Their collective routines 
are called at every compositing stage with splitting the processes into two groups. That is, given n processes, 
each of groups performs collective communications at the z' th stage, where 1 < i < log 77. 

Figure shows the event graph for all processes visualized by MPI-PD. While the program generates the 
total of 9,774 events, the visualized event graph is composed of 164 events classified into 64 failure events and 
100 successful events occurred directly before the failure events. In Figure MPI-PD points out five faulty 
processes from 64 processes: processes PE21, PE37, PE44, PE48, and PE52. It also points out that these five 
processes fall into a deadlock and that each of them has one failure event. 

As we mentioned in Section RTTl MPI-PD allows developers to visualize specific processes whichever they 
want. For example, developers can view only the deadlock processes as shown in Figure|8l so that easily know 
how the processes fell into the deadlock. They can also add related processes that communicated to the deadlock 
processes (see Figure^, so that intuitively know process PE48 received many messages compared to the other 
four faulty processes: processes PE21, PE37, PE44, and PE52. 

Thus, MPI-PD guided the developers to the five faulty events, so that they easily found that process PE48, 
the root process of a broadcast operation, called an excessive Send routine due to the lack of a break state- 
ment. Therefore, MPI-PD assists developers in scalable debugging, where the numbers of processes and events 
are too large for them to understand the behavior of programs. 

We also indicate that the buffered send operation makes it complicated to locate faults, since this operation 
causes a gap between the faulty send event and the failure event. For example, when we executed the rendering 
program without error detection, since process PE48 pushed out messages in the buffered mode, it successfully 
returned from the faulty Send routine and terminated at a succeeding Recv routine. Therefore, without MPI- 
PD, the developers can investigate the Recv routine, which causes a non-original fault, or a fault due to error 
propagation. Thus, MPI-PD's run-time error detection is necessary for handling the buffered send operation. 



• Send •Recv • Iaend Irecv •coll $ Fin • peaaert 




PE21 deadlock 
PE37 deadlock 
PE44 deadlock 
PE48 deadlock 
PE52 deadlock 



Figure 8: Faulty processes isolated by MPI-PD. This graph shows only faulty processes and communications 
among them. 



5.3 Study 3: Comparison with existing debuggers 

To make clear the usability of fault localization, we compared MPI-PD with TotalView IEtn02l by applying 
them to a complicated program. This program is automatically generated by a parallelizing compiler based on 
a task scheduling algorithm, Scheduling with Packaged Point-to-point Communications (SPPC) I YT FH02I . 

The MPI program generated by SPPC consists of two layers, the calculation and the communication layers, 
which repeatedly appear during program execution. In the calculation layer, each process independently per- 
forms calculation without any communication. In the communication layer, it exchanges messages by calling 
nonblocking communication routines. Each process first calls many initiation routines, Isend and Irecv, 
then a completion routine, Wait all. Since the parallelizing compiler mechanically generates large-scale MPI 
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Figure 9: Faulty processes and their related processes isolated by MPI-PD. Related processes are such that 
faulty processes communicate with them. 



programs, it requires a complicated work to debug them. Furthermore, since the Wait all routine completes 
all of initiated communications at a time, it is time-consuming to distinguish failure communications from a 
number of communications completed by the Wait all routine. 

Figure ITOl shows the visualizations obtained by MPI-PD and Total View. While MPI-PD visualizes all of 
failure events occurred on each process and the successful events occurred directly before the failure events, 
TotalView shows pending sends/receives and unexpected messages IICG99I IEtn02l at an arbitrary execution 
step. Pending sends/receives represent the sends/receives that have been initiated but have not yet been matched. 
Unexpected messages represent messages that have been sent to a process but have not yet been received. 

In this program, every process terminated at a call of Waitall routine. At the termination, the processes 
tried to complete the total of 171 nonblocking operations. For this faulty program, TotalView visualizes 50 
pending receives, represented as arrows in Figure ITOl b). However, it is time-consuming for the developers to 
investigate each of the 50 pending receives. On the other hand, MPI-PD checks the error of every communi- 
cation and localizes faulty processes, so that it visualizes 34 of 171 events as shown in Figure ITot a). Since 
eight of 34 events are successfully communicated events, MPI-PD reduces the number of events that have to be 
investigated from 171 to 26 events. Furthermore, it points out that processes PE5 and PE10 fall into a deadlock. 
Here, processes PE5 and PE10 have three and seven error events, respectively, so that the number of events that 
have to be investigated is reduced further from 171 to 10 events. 

With the assistance of MPI-PD, the developer has successfully debugged this program less than five min- 
utes. He first investigated process PE5 and confirmed that it had no fault, and then process PE10. At last, he 
reached at the fault where an invalid source was specified at an Irecv routine. 

Table[3]summarizes the difference among MPI-PD, TotalView, and DeWiz |Kra02a Kra02b|. While MPI- 
PD is useful to reduce events that have to be investigated, TotalView allows us to execute the target program in 
stepwise. DeWiz also provides an analysis using the event graph. However, DeWiz aims at identifying closely 
related processes and reducing the total amount of trace data. In DeWiz, by giving a specific process, then 
its process grouping function accumulates the length of transmitted messages for every pair of processes and 
isolates related processes by using a certain threshold. Therefore, developers have to decide which processes 
have to be specified, and this is a similar problem addressed in this paper. Furthermore, since error propagation 
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(a) Event Graph by MPI-PD (b) Message Queue Graph by TotalView 

Figure 10: Visualizations obtained by MPI-PD and TotalView. 



Table 3: Difference among MPI-PD, TotalView, and DeWiz. 



Function 


MPI-PD 


DeWiz |Kra02a Kra02b| 


TotalView |Etn02| 


1. Faulty process localization 


by dependency analysis 






2. Run-time error detection 


every message 


every message 


every message 


3. Process grouping 


by dependency analysis 


by message length 




4. Timeline visualization 


yes 


yes 




5. Trace file reduction 




yes 




6. Stepwise execution 






yes 



has no relevance to message length, their message length based approach is inappropriate for the purpose of 
faulty process localization. 

Summarizing the above discussions, DeWiz is useful to reduce the total amount of trace files and TotalView 
is useful to investigate the detailed behavior of programs. MPI-PD is useful to reduce the number of events that 
have to be investigated for debugging. Therefore, we think that appropriate combined use of these tools is a 
good choice for debugging message passing programs. For example, we first localized faulty processes by using 
MPI-PD and next investigate them in detail by using TotalView. 



6 Conclusions 

We have presented a novel debugging tool, named MPI-PD, for localizing faulty processes in message passing 
programs, aiming at reducing developers' efforts. MPI-PD helps us to identify the source of failure from a 
number of observed errors by automatically checking communication errors during program execution. If MPI- 
PD observes any communication errors, it then generates a trace file, backtraces communication dependencies 
and points out potentially faulty processes in the event graph visualization. 

MPI-PD reduces the amount of debugging information before visualizing and investigating it by using post- 
mortem performance debuggers and source-level debuggers, respectively. Therefore, we think that appropriate 
combined use of these tools is a good choice for debugging message passing programs. 
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